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Abstract 

We report on a language resource consisting of 2000 annotated bibliography entries, which is being analyzed as part of our research on 
indicative document summarization. We show how annotated bibliographies cover certain aspects of summarization that have not been 
well-covered by other summary corpora, and motivate why they constitute an important form to study for information retrieval. We detail 
our methodology for collecting the corpus, and overview our document feature markup that we introduced to facilitate summary analysis. 
We present the characteristics of the corpus, methods of collection, and show its use in finding the distribution of types of information 
included in indicative summaries and their relative ordering within the summaries. 



1. Introduction 

Automatic text summarization has largely been synony- 
mous with domain-independent, sentence extraction tech- 
niques (for an overview, see Paice (199C)). These ap- 
proaches have used a battery of indicators such as cue 
phrases, term frequency, and sentence position to choose 
sentences to extract and form into a summary. An alter- 
native approach is to collect sample summaries and apply 
machine learning techniques to identify what types of infor- 
mation are included in a summary, and identify their stylis- 
tic, grammatical, and lexical choice characteristics and to 
generate or regenerate a summary based on these charac- 
teristics. In this paper, we examine the first step towards 
this goal: the collection of an appropriate summary cor- 
pus. We focus on annotated bibliography entries, because 
they are written without reliance on sentence extraction. 
Futhermore, these entries contain both informative (i.e., de- 
tails and topics of the resource) as well as indicative (e.g., 
metadata such as author or purpose) information. We be- 
lieve that summary texts similar in form to annotated bib- 
liography entries, such as the one shown in Figure |l|, can 
better serve users and replace standard n-top sentence or 
query word in context summaries commonly found in cur- 
rent generation search engines. 

Our corpus of summaries consists of 2000 annotated 
bibliography entries collected from various Internet web- 
sites using search engines. We first review aspects and di- 
mensions of text summaries, and detail reasons for collect- 
ing a corpus of annotated bibliography entries. We follow 
with details on the collection methodology and a descrip- 
tion of our annotation of the entries. We conclude with 
some current applications of the corpus to automatic text 
summarization research. 



2. Dimensions of summarization 

With the current widespread language resources that are 
available on the web, constructing a large corpus of docu- 
ment summaries is becoming easier. However, document 
summaries hav e man y different aspects and purposes (Mani 
and Maybury (1999), introduction), and thus it is impor- 
tant to clarify which aspects of summarization a collection 



Maxwell, S. E., Delaney, H. D., & O'Callaghan, M. F. 
(1993). Analysis of covariance. In L. K. Edwards (Ed.), 
Applied analysis of ... 

This paper gives a brief history of ANCOVA, and then 
discusses ANCOVA in the context of the general linear 
model. The authors then provide a numerical example, 
and discuss the assumptions of ANCOVA. Then four 
advanced topics are covered: ... This paper is quite 
theoretical and complex, but contains no matrix alge- 
bra. 



Figure 1: Sample excerpt from an annotated bibliography 
entry. 

covers. We briefly examine several different dimensions of 
summaries. 

• Extract versus Abstract - Summaries that are con- 
structed by extracting important passages, sentences 
or phrases from the source document are considered 
extracts. In contrast, an abstract may or may not con- 
tain words in common with the document. Authors 
using abstractive techniques are not as constrained 
as those using extractive ones, and can summarize a 
wider range of materials effectively (e.g., naiTatives) 
and often with smaller amounts of text. 

• Informative versus Indicative - Informative sum- 
maries attempt to include all important points of the 
document in the summary. Examples include book re- 
ports or scientific abstracts of technical articles. In- 
dicative summaries hint at the topics of the document, 
and do not serve as any type of surrogate for the source 
document. From an information retrieval perspective, 
we can think of the indicative summary as text that 
helps a user to decide whether they should consider 
retreiving the full text of the source document. Exam- 
ples of indicative summaries include annotated bibli- 
ography entries and library card catalog entries. 

• Generic versus Query-based - Summaries that treat 
all topics of a source document with equal weight are 



generic summaries, whereas a query-based summary 
gives particular attention to a specific facet of the doc- 
ument. While library card catalog entries are generic 
summaries, annotated bibliography entries that are 
part of a themed collection (e.g., "Books about Me- 
dieval Arms and Armor") are often biased towards the 
collection's topic, and may highlight or only mention 
information relating to its theme. 

• Single Document versus Multidocument - Multi- 
document summaries typically summarize a set of 
documents that are related in some fashion. Current 
multidocument summary techniques have focused on 
articles provided by different sources, or which are 
updates of previous articles on an event (Radev and 
McKeown, 1998). 

3. Related work in summary corpora 

With these dimensions of text summarization in mind, 
we can discuss different existing summary corpora, and 
show how they relate to these particular dimensions. This 
is shown in Table |l|. 

3.1. News summaries 

The Document Understanding Conference (DUC) was 
first held in 2001, sponsored by the National Institute 
of Science and Technology (NIST) (Harman and Marcu, 
2001). It is a competition in the "bake-off" style which 
pits systems against each other in summarizing the same set 
of input documents. For the first DUC competition, train- 
ing corpora of sample input documents and sample sum- 
maries were provided by NIST in consultation with the re- 
search community. Both single document and multidocu- 
ment generic summaries were made available to groups to 
train 15 different summarization systems. The DUC sum- 
mary corpus was constructed by both extractive and ab- 
stractive techniques, and tend to be informative rather than 
indicative. 



Jing and McKeown (1999) also have made use of source 
document and target summary relation, in their use of Hid- 
den Markov Models for summarization. Their "cut and 
paste" method was demonstrated on the Ziff-Davis sum- 
mary corpus of computer peripheral review articles. The 
Ziff-Davis summary corpus is a single document corpus 
that is generic and mostly extract-based. 

3.2. Scientific summaries 

There have been a number of studies using abstracts 
of scientific articles as a target summary. Kupiec, Peder- 
sen and Chen (1995)'s work is an instance of this, where 
they use 188 Engineering Information summaries that are 
mostly indicative in nature. Abstracts tend to summarize 
the document's topics well but do not include much use of 
metadata, which is of interest to our study, further explained 
in Section ^ 

3.3. Snippets 



Snippets ( A.mitay, 200C ), are short, textual descriptions 
that authors of web pages provide to give an indicative de- 
scription to a hyperhnked document. These snippets are 



often very short, as in the case of the descriptions con- 
nected to Yahoo! or Open Directory Project (ODP) cate- 
gory pages. Amitay describes strategies for locating and 
extracting snippets from various types of web pages, and 
applies machine learning to rank different snippet descrip- 
tion of the same document for fitness as a document sum- 
mary. 

This solution only works for resources that have ex- 
isting snippets. Newly-authored documents (of interest to 
people trying to keep current) cannot benefit from past snip- 
pets, since they refer to different resources. Amitay's work 
lays the foundation for building the tools to collect such a 
snippet corpus, but unfortunately does not provide a pubU- 
cally available tool nor corpus. 

3.4. Card catalog summaries 

Library card catalog entries in the physical library (and 
their electronic, machine-readable record conterparts in the 
automated library) also provide in dicative summarie s of re- 
sources. Our preliminary study ( Kan et al., 2001 ) exam- 
ined these resources to get a first-round approximation of 
the contents of indicative summaries. Library catalog en- 
tries consist of structured fields, of which a summary is 
an optional field. These summary fields are often provided 
by third-party vendors who may not be aware of the other 
fields present in the catalog. In our local online catalog, 
other types of information (such as notes, or book jacket 
texts, or book reviews) were often substituted for sum- 
maries. 

4. Annotated bibliography entries 

Broadly speaking, our research focuses on how auto- 
matic text summarization techniques can be applied to un- 
derstanding search engine results. Our goal is not to an- 
alyze what makes one summary better than another, but 
to learn how to generate a suitable summary of a resource 
based on machine learning over a compiled corpus. A "suit- 
able" annotation can span many different dimensions, but in 
our case mainly concerns space/length limitations. Current 
standard technology presents search results as a ranklist of 
10 or 20 document "hits", accompanied by short extract 
summaries. An alternative approach is to present the docu- 
ments with more meaningful summaries that explicitly as- 
sist the user in choosing a document to examine or in de- 
ciding that none of the retrieved documents are useful. 

To fulfill this purpose, query-based indicative sum- 
maries constructed by abstractive techniques are most rel- 
evant. We believe abstracts are more powerful than ex- 
tracts because they have the capability to yield more con- 
cise and accurate summaries. Similarly, indicative summa- 
rization is an equally important facet, as it provides sum- 
maries tailored to our information retrieval application, in 
which source documents are readily available. For these 
reasons, both the DUC and Ziff-Davis corpora are not well 
suited to our study. Scientific abstracts and library card cat- 
alog summaries are largely generic and thus do not give us 
an opportunity to study query-based summarization. The 
study of snippets most closely aligns with the purpose of 
our study, but a compiled corpus of snippets is not publi- 
cally available, neither is a tool for locating them. 
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Table 1: Sample summary corpora types mentioned in this paper. 



Instead, we examined a different class of summary 
texts, the annotated bibliography entry. Annotated bibli- 
ographies are created mostly by abstractive methods and 
include both indicative and informative forms. An anno- 
tated bibliography entry is a summary of a book or other 
resource that annotates a resource with a description of the 
text, as shown in Figure |l| 

From our empirical observations of both annotated bib- 
liography entries, snippets and library card catalog entries, 
bibliography entries have some unique features that make 
them attractive and challenging to process. Bibliography 
entries often: 

• are lengthier than both card catalog summaries and 
snippets. They often exhibit more variation of sen- 
tence structure and lexical choice. This makes the sub- 
sequent analyses rich and allows (re)generation based 
on these analyses to construct more varied and inter- 
esting text. 

• are organized around a theme, making them ideal stan- 
dard for "query-based" summaries. Bibliography en- 
tries also have more explicit comparison of one re- 
source versus another, which can help a user deter- 
mine whether which document to choose for a partic- 
ular purpose. 

• have prefacing text that overviews the documents in 
the bibliography. This preface text is a good model 
for summarizing a set of related items (e.g., different 
books on arms and armor or different earthquakes re- 
ports in 1992). This is in contrast to multidocument 
summaries that summarize articles with mostly over- 
lapping information (news reports on a single event 
and updates to the event). 

• are rich in meta-information document features — they 
often mention edition, title, author and purpose. These 
document features are not always present in or in- 
ferrable from the body text of a source document. Our 
previous study of library card catalog entries showed 
that these document features are well represented (and 
thus important). 

The construction of annotated bibliographies is a well- 
established field in information science studies. Thus, the 
form has many descriptive guidelines that we examined that 
validate the above observation s . Writing gu ides such as 
( |Rees, 1970| ; |Engle et al., 1998| ; [Lester, 200 1|; An ne Arun- 
del Community College, 1998; [Wilhams, 2002|) indicate 



specific types of information that should be included in an- 
notated bibliographies; and are synopsized in Table ^. 
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Table 2: Prescribed features of annotated bibliographies 
from several sources 

These resources are all guidelines for the content of an- 
notated bibliography entries. The guidelines are prescrip- 
tive, and thus, it is important to validate them by examin- 
ing actual annotated bibliographies to see whether a) the 
guidelines on content are followed, and b) to establish the 
content's ordering and grammatical structure. 



5. Annotated bibliography language 
resource 

Our language resource of annotated bibliography en- 
tries was designed to ease the collection of the corpus as 
well as to make many features available for subsequent 
analysis for summarization and related natural language ap- 
plications. 

5.1. Collection methodology 

The collection of the bibliography entries was done by 
spidering search result pages from two search engines (Al- 
taVista and Google) for the keywords "annotated bibliog- 
raphy". The collection was compiled in September 2001 
and software filters were written to parse and retrieve the 
contained URLs from each site (200 from AltaVista and an 
additional 1000 from Google). By our estimates, roughly 
60% of the pages that were gathered had errors in retrieval 
(e.g., were stale URLs), were duplicate entries, or did not 
contain bibliographic entries. This leaves an approximate 
500 pages with actual bibliographic entries to draw from. 

An examination of the materials in these remaining doc- 
uments revealed that most pages organized around a spe- 
cific purpose, and varied greatly in collection size. Most 



common were large collections of 20 to 100 entries and in- 
troductory pages to even larger collections (over 1000 en- 
tries). Pages that only annotated a few items were much 
less common; we suspect that this is due to the inherent 
bias of the search engine ranking metric to rank sites that 
are more prominent (which we believe is highly correlated 
with larger collections). The smaller collections were of- 
ten a part of a larger website or were the last section of a 
larger webpage on the topic of interest. With this struc- 
ture in mind, we decided to take at most 50 entries from 
each source document to ensure that we covered a breadth 
of annotated bibliography entry sources in collecting the fi- 
nal corpus. We examined the documents in order of their 
appearance on the AltaVista hitlist, and as a result, only a 
total of 64 documents from the AltaVista spidered collec- 
tion were used to create the 2000-entry corpus. If all of the 
bibliographic entries were extracted from the documents, 
it would easily exceed 20,000 entries in size (as many of 
the collections had many more than 50 entries). Docu- 
ments spidered from Google have so far not processed and 
added to the bibliography collection; we plan to include the 
processing of these documents and other sources as future 
project time allows. 

5.2. Encoding the XML bibliographic entry corpus 

Bibliography entries from the 64 spidered pages were 
then manually cut-and-pasted into the corpus collection 
web interface. This was both to ensure that the entries were 
being correctly delimited, and to add fields to each entry 
that may assist in future analysis and serve as a gold stan- 
dard for future machine learning tasks. The corpus is en- 
coded in XML and includes the following fields in addition 
to the bibliographic entry itself. 

• Subject: the subject or theme of the annotated bibliog- 
raphy page. 

• Domain: annotated to aid analysis of differentiation of 
features that are domain-independent from ones that 
are domain-dependent. We encode the domain rather 
coarsely (e.g., all of medicine as a single domain) 
and in an ad-hoc manner without the assistance of an 
ontology. Finer granularity is provided by the above 
subject field. 

• Micro Collection (optional): the internal division in 
the bibliography page that the entry is a part of (e.g., 
"reference books" section of a bibliography on the 
colonial times in Jamestown). 

• Macro Collection (optional): the division that the 
physical bibliography page represents in the set of re- 
lated bibliography pages (e.g., "all colonies in colonial 
times in the U.S." with respect to the last example). 
The macro collection field is used when the bibliogra- 
phy physical page relates itself to other physical pages. 
In our observations, only very large collections exhibit 
both micro and macro collection attributes. Figure ^ 
illustrates the relation of these two attributes. 

• Offset: the position of the entry on the page. 



• Before Context: text before the body of the annotated 
entry itself. This often contains cataloging and biblio- 
graphic information, such as the title, author, and call 
number^. 

• After Context (optional): text that is distinctly marked 
off as coming after the body of the annotated entry. 
Used sometimes to mark publisher information, web 
URLs and pointers to other resources. Information 
that typically is contained in this field in one docu- 
ment may simply be appended to the end of the bib- 
liographic entry in other documents; this distinction 
may be more of a stylistic one. 

• URL: the location of the source document where the 
entry was drawn from. 



Macro Collection / Website: 

Bibliography of resources on the colonial limes in the United Stales 



Tide: Jamesiown 



Micro Collection: References books 



Figure 2: Relation of micro and macro collection attributes 

To facilitate our local analysis of the corpus, all of the 
bibliographic entries have also been parsed with a proba- 
bilistic dependency parser (Collins, 1996). These parsed 




entries are also included in the XML corpus, as a sepa- 
rate XML field attached to each entry (the parse dEntry 
field). Figure || shows a sample entry after it has been 
parsed into our XML format. 

<bibEntry id-"id26" title-"Analysis of covariance" 
url-"http: / /www . math .yorku.ca/SCS/biblio. html " 
type="paper " domain-" Stat istics " 
microCollection-"Analysis of Covariance" 
of fset="4"> 
<beforeContext> 

Maxwell, S. E., Delaney, H. D., & O' Callaghan, 

M. F. (1993). Analysis of . . . 
</bef oreContext> 

<entry><OVERVIEW>This <MEDIATYPES>paper</MEDIATYPES> 
gives a brief history of ANCOVA, and then discusses 
ANCOVA in ... contains no matrix 
algebra. </DIFFICULTY> 

</entry> 

<parsedEntry> 

PROB 14659 -112.252 TOP -112.252 S -105.049 NP-A 
-8.12201 NPB -7.82967 DT This NN paper ... 

</parsedEntry> 
</bibEntry> 



Figure 3: Portion of the annotated bibliographic entry from 
Figure |l], represented as structured fields in our XML cor- 
pus. 

5.3. Semantic annotation of document features 

To perform a detailed study of what information is nor- 
mally present in annotated bibliographic entries, we needed 



' Currently, this is saved as an unstructured text field. It would 
be best to parse these entries into structured fields but our focus 
is on the text and content of the entries themselves, and not these 
auxiliary fields. 



to inventory the different document features (types of in- 
formation) used in the entries. We re-used our original 14 
document features used in our earlier work on library card 
catalog entries (as mentioned in Section 3.4) and further en- 
riched the feature set to include additional tags that better 
represent the range of information we found in the anno- 
tated bibliography entries. We also took into account an- 
notated bibliographic guidelines, as mentioned in Section 
^ We randomly picked 100 of the 2000 entries to annotate 
using this scheme. Table |3] shows the expanded, 24 docu- 
ment feature set used in the markup. 

6. Corpus attributes 

Table || also lists distributional features of the tagged 
document features in the 100 annotated entries. The first 
column shows the number of times that the annotated fea- 
ture was used to mark information in the entries. The sec- 
ond column gives the precentage of documents that have an 
instance of the feature in question. Features were marked 
at the sentence level or on smaller units. The columns are 
highly correlated, and show that multiple occurrences of the 
same tag within an entry happen quite frequently. 

We divided the features into topically related and unre- 
lated features. We distinguish between three different top- 
ically related features. Overview sentences usually begin 
the annotated bibliography entry and include a high level 
overview of the content of the resource. They appear in 
a majority of annotated bibliography entries and generally 
are limited to a single sentence. Topic features give a list 
of topics treated by the source, as an itemized or comma- 
delimited list. Detail sentences represent all other general 
item-specific sentences. In our observations across the 100 
entries that we annotated, these sentences were the most 
variable. Short entries tended not to have any detail sen- 
tences, but as we examined entries of longer length, mostly 
details were being added. 

The data validates both prescriptive guidelines and our 
earlier work in showing that metadata fields (marked with 
stars in Table jsj) are important for summaries. Audience 
information, recommended by four of the five prescriptive 
guidelines, were shown to appear 12% of the time. Other 
metadata fields, such as purpose, navigation/internal struc- 
ture, subjective assessment, and readability also play im- 
portant roles. 

A noticeable difference between our earlier work on 
card catalog entries is that the title field does not appear 
in any of the annotated bibliography entries. We surmise 
this is because its mention would be redundant, as the ti- 
tle is always given as text in the bef oreContext XML 
field. However, this is not true of author information, as the 
document feature is often used to present the credentials of 
the author In contrast, library card catalog entries did ex- 
hibit the title field quite often. We feel that this is because 
card catalog summaries were often book jacket or other re- 
lated standalone texts that may not have easy access to the 
bibliographic information. 

Table ^ shows how the distribution of the 24 document 
features varies with length and indicates where the features 
occur within the summary. The numbers between and 1 in 



paratheses indicates how close the average instance of the 
document feature is to the beginning (0) of the summary en- 
try or to the end (1). Middle range numbers (e.g., .50) often 
indicate that the field occurred widely across different po- 
sitions in the entries, especially when the feature frequency 
is high. Entries tended to include 2 to 6 document features, 
and long bibliography entries were fairly rare (entries with 
13 or more document feature instances represent only 6% 
of the annotated corpus). Normal entries containing 2 to 6 
document features correspond to 2 to 4 sentence- or phrase- 
length entries. 

Examining the ordering data, it is quite apparent that 
some of the fields naturally occur before or after others. 
Overview sentences generally comes very early in the bibli- 
ography entry, and information on who wrote the entry (the 
contributor) usually comes very late. Subjective assessment 
or critique of a resource usually comes after an explanation 
of the resource, thus comes later in the summary. Order- 
ing among the features is quite variable, but it is obvious 
that many of features either tend to occur earlier (e.g., bib- 
liographic information) or later (e.g., subjective assessment 
or complicated types of metadata) with topical information 
filling in the space between. 

7. Corpus miscellanea 

Command-line utilities also provided to modify, insert 
and extract attributes from the corpus. The web-based CGI 
scripts used by the authors to build and analyze the corpus 
are also provided. 

The corpus will be made web-accessible to licensed par- 
ties. We would like to encourage other research groups 
to join in expanding the collection and annotation of ad- 
ditional bibliographic entries. 

7.1. Availability and copyright issues 

The corpus is available for academic and not-for-profit 
research, by request to the first author A licensing agree- 
ment is required in order to acquire the corpus and is avail- 
able on the Columbia Natural Language Group's "Tools" 
page^ An annotation guide, explaining the annotation tag- 
ging guidelines in more detail, will also be made available. 

As the bibliographic entries themselves are mostly 
copyrighted by the individual parties that have authored 
the entries, we can only distribute the entries under the 
United States' Fair Use copyright exemption, which allows 
the copying or excerpting of copyrighted text for non-profit 
research and scholarship purposes. Other for-profit insti- 
tutions interested in acquiring the corpus should also con- 
tact the first author for information. The delimitation and 
annotations of the entries can be separated from the entry 
texts themselves using standoff annotations and can be dis- 
tributed; institutions can then follow up with individual au- 
thors for rights to the source texts. 

8. Future work 

The corpus serves as a basis for our current research in 
corpus-trained natural language generation. In a high-level 

^http://www.cs.columbia.edu/nIp/tooIs.html 



Document Features 



# tag occurrences 
(tag frequency) 



% entries possessing tag 
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Topicality document features - features based on contents of the body text 



Detail 

Quotations, extracted sentences, parts of a chronology, conclusions 

Overview 

(Generalized description of the entire resource, "This book is about Louisa Alcott's life.") 

Topic 

(High-level list of topics, e.g., "Topics include symptoms, ...") 
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Metadata and document-derivable features - features that are domain- and genre-independent 



Media Type 

(e.g. "This book "A weblet "Spans 2 CDROMs") 

Author / Editor* 
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(e.g. "figures and tables") 
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(e.g. "based on a report") 
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(e.g. "is organized into three parts") 

Collection Size* 
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(e.g. "for adult readers") 

Contributor* 

Name of the author of the annotated entry 

Cross-resource Comparison* 

(e.g., "similar to the other articles" 

Size/Length 
Style* 

(e.g., "in verse rhythm", "showcased in soft watercolors") 

Query Relevance* 

(text relevant to the theme of the annotated bibliography collection) 

ReadabiUty* 
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Table 3: Distribution of the document features in the 100 entry annotated portion of the corpus. Starred entries denote 
metadata fields. 



strategic component, we establish ordering preferences be- 
tween the document features to determine when in the sum- 
mary they occur. In a low-level tactical component, we find 
constraints on the lexical reaUzation and phrasing of the 
document features. We are also in the continuing process 
of refining our tagset (particularly in further differentiating 
detail sentences into particular subclasses) and collecting 
and annotating additional corpus entries. 



9. Conclusions 

We have presented our motivations for collecting a cor- 
pus of annotated bibhography entries, as a means of study- 
ing appropriate summary forms for documents in informa- 
tion retrieval displays. Annotated bibliography entries are 
constructed by abstractive techniques and display both in- 
dicative and informative qualities. While topical, content 
based features are prominent and necessary in summaries, 
guidelines have suggested that summaries should also in- 
clude metadata and critical document features. Our corpus 



study has shown that these guidelines are followed in ac- 
tual annotated entries, and furthermore have quantitatively 
assessed their importance and explored their internal order- 
ing within summaries of different lengths. 

We have detailed the methodology used to collect the 
2000-entry corpus and detailed our annotation and docu- 
ment feature distribution across 100 randomly selected en- 
tries. The corpus is available for non-profit research use 
and we would like to encourage other researchers to use 
and contribute to this corpus as well. 
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Table 4: Feature distribution across entries of different document lengths. Frequency of document feature given as entry, 
average relative position of feature given in parentheses (0 indicates the beginning of the entry, 1, the end of the entry). 
Document features hsted in order of descending frequency in the annotated corpus. 
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